Introduction

This project is a tutorial of using multivaraite time series analysis for the stock market index, NIFTY 50 from NSE (National Stock Exchange) India. The data is obtained from Nifty 50 contains price history and trading volumes of fifty stocks in India from 2000-01-03 to 2020-09-30.

We illustrates how to using Python, R, and Stata to apply Auto Regressive Integrated Moving Average (ARIMA) to time series data. ARIMA is able to fit a given non-seasonal non-stationary time series based on its lag values. A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.

A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.

An ARIMA model is characterized by 3 terms: p (the order of AR term), q (the order of the MA term), and d (number of differencing to make time series stationary). Given a time series \(\{X_t\}\), an \(ARIMA(p, d, q)\) model can be expressed as: \[(1-\sum_{i=1}^p\phi_iL^i)(1-L)^dX_t= (1+\sum_{i=1}^q\theta_iL^i)\epsilon_t + \delta\] where \(\epsilon_t\) is the error term, \(L\) is the lag operator, i.e. \(LX_t = X_{t-1}, \forall t>1\), \(p\) is the number of lagged terms of \(X\), \(d\) is the number of times of differencing needed for stationarity, \(q\) is the number of lagged forecast errors in prediction, \(\delta\) is the interception term for the regression, and \(\theta, \phi\)’s are the estimated regression coefficients.

ARIMA models are fitted in order to understand the data better and forecast future data. They are based on linear regression models. The best model can be chosen using AIC or BIC.

Data Description

NIFTY 50 data consist of 50 stocks, 230104 observations on 15 variables. The data contains daily open, close, highest and lowest prices, volume and other relevant information for the “Nifty Fifty” stocks since January 2000. Detailed variable descriptions are shown in Table 1 below.

Click to see variable descriptions.
Table 1. Variable descriptions of NIFTY 50 data
Variable Name Variable Description
Date Date of trade
Symbol Name of the company
Series We have only one series: Equity(EQ)
Prev Close Previous day’s close price
Open Open price of day
High Highest price in day
Low Lowest price in day
Last Last traded price in day
Close Close price of day
VWAP Volume Weighted Average Price
Volume A measure of sellers versus buyers of a particular stock
Turnover The number of shares available for sale
Trades The number of shares traded
Deliverable Volume Shares which are actually transferred among demat accounts
%Deliverable Percent of deliverble volume
Return Return of trade

As we are more interested in the stock prices, we use the variable VWAP for the most part. It can summarize the average price of the stock on a trading day. We want to catch the trend of stock prices across the years and possibly forecast future stock prices.

Data Cleaning

Python

Firstly, we import the data.

import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
# Just to make sure the packages are loaded.
import pandas as pd
import numpy as np
df = pd.read_csv('./NIFTY50_all.csv')

The variable Trades has 50% missing values, we can delete it. We can delete redundant variables and impute %Deliverable via Simple Imputation because there are not so many missing values (about 7%). We also need to merge different symbols for the same stock.

from datetime import datetime # Dealing with date format data
from sklearn.impute import SimpleImputer # Impute data
# Drop redundant variables and variables with too many missing values
df['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in df['Date']]
df1 = df.drop(['Trades', 'Deliverable Volume', 'Series'], axis=1)
ls1 = ['MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
       'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
       'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE']
ls2 = ['ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
       'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
       'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL']
df1['Symbol'] = df1['Symbol'].replace(ls1, ls2)
df1['Symbol'] = pd.Categorical(df1['Symbol'])
# Impute missing values
df2 = pd.get_dummies(data=df1, drop_first=True)
df2['Date']=df2['Date'].map(datetime.toordinal)
imp = SimpleImputer()
p = imp.fit_transform(df2)
df1['%Deliverble'] = p[:, 10]

R

Before conducting core analysis, let’s clean our data and check basic data structure.

colnames(nifty)[colSums(is.na(nifty)) > 0]
## [1] "Trades"             "Deliverable Volume" "%Deliverable"

As we can see, variables Trade, Deliverable Volume, and %Deliverable has missing values and we need to convert them to 0. Besides, we found out that there are stocks that changed its names during 2000 to 2020 period, so we need to bring their names into accord.

library(dplyr)
library(ggplot2)

# change old stock names to new
old_name = c('MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
       'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
       'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE')
new_name = c('ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
       'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
       'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL')

nifty$Symbol = plyr::mapvalues(nifty$Symbol, from = old_name, to = new_name)

# summary statistics of variable of interest
nifty_clean = nifty %>%
  replace(is.na(.), 0) %>% 
  mutate(Return = Close - `Prev Close`) %>% 
  select(Date, Symbol, VWAP, Volume, Turnover, Return)
summary(nifty_clean)
##       Date               Symbol               VWAP              Volume         
##  Min.   :2000-01-03   Length:230104      Min.   :    9.21   Min.   :        3  
##  1st Qu.:2006-05-23   Class :character   1st Qu.:  272.43   1st Qu.:   209105  
##  Median :2011-06-07   Mode  :character   Median :  554.00   Median :   973757  
##  Mean   :2011-02-20                      Mean   : 1224.17   Mean   :  2745270  
##  3rd Qu.:2016-02-05                      3rd Qu.: 1213.47   3rd Qu.:  2836929  
##  Max.   :2020-09-30                      Max.   :32975.24   Max.   :481058927  
##     Turnover             Return         
##  Min.   :1.047e+07   Min.   :-19525.95  
##  1st Qu.:1.520e+13   1st Qu.:    -5.80  
##  Median :6.400e+13   Median :     0.05  
##  Mean   :1.433e+14   Mean   :     0.25  
##  3rd Qu.:1.710e+14   3rd Qu.:     6.40  
##  Max.   :3.560e+16   Max.   :  2107.70

Stata

Firstly, we import the data and do some data cleaning. We drop the variable and %Deliverable. Also, we transform the variable from string type to date type to treat the whole data set as time series data set.

import delimited NIFTY50_all.csv, clear

* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"

* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"

* Save the cleaned data
save NIFTY_clean, replace 

Core Analysis

Python

Set up

The main packages used in Python tutorial are:

  • pandas: For data cleaning and data frame mutations
  • numpy: For numeric data processing
  • datetime: For date format data processing
  • sklearn: For missing data imputing
  • statsmodels and pmdarima: For the main ARIMA model
  • matplotlib: For making plots

Data visualization

We can take a look at the data. Take the stock “ADANIPORTS” as an example.

import matplotlib.pyplot as plt # Plotting package in python
names = df1['Symbol'].cat.categories
example = df1[df1['Symbol'] == names[0]]
fig, ax = plt.subplots(3, 1, figsize=(8, 8))
ax[0].plot(example['Date'], example['VWAP'])
ax[0].set_xticks([])
ax[0].set_xlabel('Days')
ax[0].set_ylabel('Volume weighted average price')
ax[1].plot(example['Date'], example['Volume'])
ax[1].set_xticks([])
ax[1].set_xlabel('Days')
ax[1].set_ylabel('Volume')
ax[2].plot(example['Date'], example['Turnover'])
ax[2].set_xticks([])
ax[2].set_xlabel('Days')
ax[2].set_ylabel('Turnover')
ax[0].set_title('Time series plots of stock %s' % names[0])

Figure 2.1. Time series plots of example stocks

Determine model parameters

Python tutorial will use the time series VWAP for the analysis below.

The differencing parameter \(d\) of the model can be determined by doing Augmented Dickey-Fuller tests, which can indicate whether the time series are stationary. See the reference for more details about ADF tests. The python packages statsmodels.tsa and pmdarima.arima are very helpful here.

from statsmodels.tsa.arima_model import ARIMA 
from pmdarima.arima.utils import ndiffs # ARIMA model packages
# This function chooses the smallest d for the series to be stationary
names = df1['Symbol'].cat.categories
ls0 = []
for i in names:
    subdf = df1[df1['Symbol'] == i]
    # Select the rows for stock i
    ls0.append(ndiffs(subdf['VWAP'], test='adf'))
ls0  # Most values are 1
## [0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
max(ls0) # We don't need 2-order differencing
## 1

In order to ensure all time series are stationary, we choose \(d=1\) for all stocks.

The AR parameter \(p\) of the model can be determined by looking at Partial Autocorrelation plots. These plots indicate the correlation between the series and its lag. We use the first 4 stocks alphabetically as a sample. Notice the series need to be differenced first (\(d=1\)).

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
# Take the first 4 stocks as a sample
for i in range(4):
    subdf = df1[df1['Symbol'] == names[i]]
    # Select the rows for stock number i
    plot_pacf(subdf['VWAP'].diff().dropna(), ax=ax[i])
    ax[i].set_title('PACF plot for stock %s' % names[i])

Figure 2.2. PACF plots for choosing p

Notice lag 1 is at least borderline significant in the plots for all stocks, but lag 2 is not significant. Therefore, we can choose \(p=1\) for all stocks.

The MA parameter \(q\) of the model can be determined by looking at Autocorrelation (ACF) plots. We use the next 4 stocks alphabetically as a sample. Similarly, the series need to be differenced (\(d=1\)).

fig, ax = plt.subplots(1, 4, figsize=(15, 5))
for i in range(4):
    subdf = df1[df1['Symbol'] == names[i+4]]
    plot_acf(subdf['VWAP'].diff().dropna(), ax=ax[i])
    ax[i].set_title('ACF plot for stock %s' % names[i+4])

Figure 2.3. ACF plots for choosing q

Notice lag 1 is significant for most of the stocks but lag 2 is not. Therefore we can choose \(q=1\) for the MA term.

Fit models

According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. After fitting the models, we can view some of the summaries. Take the first 5 stocks alphabetically as an example.

mlist = [] # Models
flist = [] # Model fits
for i in names:
    subdf = df1[df1['Symbol'] == i]
    m = ARIMA(list(subdf['VWAP']), order=(1, 1, 1))
    mlist.append(m)
    flist.append(m.fit(disp=0))
for i in range(5):
    print(flist[i].summary())
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3178
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -13340.537
## Method:                       css-mle   S.D. of innovations             16.100
## Date:                Tue, 24 Nov 2020   AIC                          26689.074
## Time:                        11:39:50   BIC                          26713.330
## Sample:                             1   HQIC                         26697.773
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const         -0.2048      0.314     -0.652      0.514      -0.820       0.411
## ar.L1.D.y      0.1049      0.151      0.694      0.488      -0.192       0.401
## ma.L1.D.y     -0.0156      0.152     -0.103      0.918      -0.313       0.282
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1            9.5325           +0.0000j            9.5325            0.0000
## MA.1           64.1530           +0.0000j           64.1530            0.0000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 5162
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -29071.318
## Method:                       css-mle   S.D. of innovations             67.549
## Date:                Tue, 24 Nov 2020   AIC                          58150.637
## Time:                        11:39:50   BIC                          58176.833
## Sample:                             1   HQIC                         58159.803
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.3108      0.960      0.324      0.746      -1.570       2.192
## ar.L1.D.y     -0.1743      0.368     -0.473      0.636      -0.896       0.547
## ma.L1.D.y      0.1985      0.366      0.542      0.588      -0.519       0.916
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -5.7387           +0.0000j            5.7387            0.5000
## MA.1           -5.0380           +0.0000j            5.0380            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 5162
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -24320.029
## Method:                       css-mle   S.D. of innovations             26.908
## Date:                Tue, 24 Nov 2020   AIC                          48648.059
## Time:                        11:39:50   BIC                          48674.255
## Sample:                             1   HQIC                         48657.225
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.0766      0.399      0.192      0.848      -0.706       0.859
## ar.L1.D.y     -0.0007      0.237     -0.003      0.998      -0.465       0.464
## ma.L1.D.y      0.0665      0.237      0.281      0.779      -0.397       0.530
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1        -1485.7047           +0.0000j         1485.7047            0.5000
## MA.1          -15.0293           +0.0000j           15.0293            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3058
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -15798.297
## Method:                       css-mle   S.D. of innovations             42.406
## Date:                Tue, 24 Nov 2020   AIC                          31604.593
## Time:                        11:39:50   BIC                          31628.695
## Sample:                             1   HQIC                         31613.254
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          0.7437      0.831      0.895      0.371      -0.885       2.372
## ar.L1.D.y     -0.2225      0.127     -1.758      0.079      -0.470       0.026
## ma.L1.D.y      0.3245      0.122      2.654      0.008       0.085       0.564
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -4.4950           +0.0000j            4.4950            0.5000
## MA.1           -3.0812           +0.0000j            3.0812            0.5000
## -----------------------------------------------------------------------------
##                              ARIMA Model Results                              
## ==============================================================================
## Dep. Variable:                    D.y   No. Observations:                 3057
## Model:                 ARIMA(1, 1, 1)   Log Likelihood              -17217.768
## Method:                       css-mle   S.D. of innovations             67.579
## Date:                Tue, 24 Nov 2020   AIC                          34443.535
## Time:                        11:39:50   BIC                          34467.636
## Sample:                             1   HQIC                         34452.196
##                                                                               
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          1.7395      1.497      1.162      0.245      -1.195       4.674
## ar.L1.D.y     -0.1671      0.071     -2.350      0.019      -0.307      -0.028
## ma.L1.D.y      0.4298      0.066      6.537      0.000       0.301       0.559
##                                     Roots                                    
## =============================================================================
##                   Real          Imaginary           Modulus         Frequency
## -----------------------------------------------------------------------------
## AR.1           -5.9832           +0.0000j            5.9832            0.5000
## MA.1           -2.3266           +0.0000j            2.3266            0.5000
## -----------------------------------------------------------------------------

Notice that for the first 3 stocks, the model fit is not good. We can consider removing the MA term since it’s non-significant for some stocks, i.e. choosing the \(ARIMA(1, 1, 0)\) model.

mlist0 = [] # Models
flist0 = [] # Model fits
for i in names:
    subdf = df1[df1['Symbol'] == i]
    m = ARIMA(list(subdf['VWAP']), order=(1, 1, 0))
    mlist0.append(m)
    flist0.append(m.fit(disp=0))

We use AIC has the model choosing criteria. The AIC decreases for the first three stocks, and increases for the 4th and 5th, indicating different stocks need different models.

includema = [] # Whether MA term should be included
for i in range(50):
    includema.append(flist0[i].aic > flist[i].aic)
pd.value_counts(includema)
## True     28
## False    22
## dtype: int64

Model diagnostics

Firstly, we can use the in-sample lagged values to predict the time series. We can plot the prediction results for the first 5 stocks.

fig, ax = plt.subplots(5, figsize=(15, 20))
for i in range(5):
    if(includema): # ARIMA(1,1,1)
        flist[i].plot_predict(dynamic=False, ax=ax[i])
    else: # ARIMA(1,1,0)
        flist0[i].plot_predict(dynamic=False, ax=ax[i])
    ax[i].set_title(names[i])
fig.tight_layout()
fig

Figure 2.4. Prediction plots

We can also forecast future VWAPs using the chosen models. For example, we can forecast the average prices in the next 200 trading days after the time series.

fig, ax = plt.subplots(5, figsize=(15, 20))
for i in range(5):
    if(includema):
        forecast, b, ci = flist[i].forecast(200, alpha=0.05)
    else:
        forecast, b, ci = flist0[i].forecast(200, alpha=0.05)
    subdf = df1[df1['Symbol'] == names[i]]
    ax[i].plot(list(subdf['VWAP']))
    idx = range(len(subdf['VWAP']), 200+len(subdf['VWAP']))
    ax[i].plot(idx, forecast)
    ax[i].fill_between(idx, ci[:, 0], ci[:, 1], 
                 alpha=0.15)
    ax[i].set_title(names[i])
    ax[i].set_xticks([])
fig.tight_layout()
fig

Figure 2.5. Forecast plots

Notice the confidence intervals are very wide, indicating it’s not easy to forecast stock prices.

Model improvement

Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.

We can use auto_arima to choose models. It compares different models and chooses the best one. Again, we use ADF test to determine \(d\), and AIC to determine \(p,q\). Take “ADANIPORTS”, “ASIANPAINT” and “BPCL” as examples.

import pmdarima as paim
subdf = df1[df1['Symbol'] == names[0]]
m1 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3)
m1.summary() # The chosen model was ARIMA(1, 0, 1), which is a good fit.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 3179
## Model:               SARIMAX(1, 0, 1)   Log Likelihood              -13346.144
## Date:                Tue, 24 Nov 2020   AIC                          26700.287
## Time:                        11:40:08   BIC                          26724.545
## Sample:                             0   HQIC                         26708.987
##                                - 3179                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## intercept      1.1366      1.144      0.993      0.321      -1.106       3.379
## ar.L1          0.9971      0.002    572.414      0.000       0.994       1.001
## ma.L1          0.0892      0.005     16.587      0.000       0.079       0.100
## sigma2       259.0152      0.418    619.397      0.000     258.196     259.835
## ===================================================================================
## Ljung-Box (Q):                       52.63   Jarque-Bera (JB):         120511061.15
## Prob(Q):                              0.09   Prob(JB):                         0.00
## Heteroskedasticity (H):               0.06   Skew:                           -22.97
## Prob(H) (two-sided):                  0.00   Kurtosis:                       955.73
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
ls0[0] # Indeed, differencing is not needed for the stock 'ADANIPORTS'.
## 0
subdf = df1[df1['Symbol'] == names[1]]
m2 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3)
m2.summary() # The chosen model was ARIMA(0, 1, 0), which is a good fit.
# For 'ASIANPAINT', the 1-order difference series is close to a constant.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 5163
## Model:               SARIMAX(0, 1, 0)   Log Likelihood              -29072.936
## Date:                Tue, 24 Nov 2020   AIC                          58147.873
## Time:                        11:40:10   BIC                          58154.422
## Sample:                             0   HQIC                         58150.164
##                                - 5163                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## sigma2      4564.8605      1.982   2303.040      0.000    4560.976    4568.745
## ===================================================================================
## Ljung-Box (Q):                       42.98   Jarque-Bera (JB):        3632346684.27
## Prob(Q):                              0.34   Prob(JB):                         0.00
## Heteroskedasticity (H):               4.65   Skew:                           -60.55
## Prob(H) (two-sided):                  0.00   Kurtosis:                      4110.73
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
subdf = df1[df1['Symbol'] == names[7]]
m3 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
                     max_p=3, max_q=3, error_action='ignore')
m3.summary() # The chosen model was ARIMA(1, 1, 2), which is a good fit.
# For 'BPCL' second order MA term is needed.
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                                SARIMAX Results                                
## ==============================================================================
## Dep. Variable:                      y   No. Observations:                 5163
## Model:               SARIMAX(1, 1, 2)   Log Likelihood              -21047.069
## Date:                Tue, 24 Nov 2020   AIC                          42102.139
## Time:                        11:40:33   BIC                          42128.335
## Sample:                             0   HQIC                         42111.305
##                                - 5163                                         
## Covariance Type:                  opg                                         
## ==============================================================================
##                  coef    std err          z      P>|z|      [0.025      0.975]
## ------------------------------------------------------------------------------
## ar.L1          0.9703      0.019     51.948      0.000       0.934       1.007
## ma.L1         -0.9021      0.020    -45.739      0.000      -0.941      -0.863
## ma.L2         -0.0779      0.008    -10.104      0.000      -0.093      -0.063
## sigma2       203.7130      0.511    398.595      0.000     202.711     204.715
## ===================================================================================
## Ljung-Box (Q):                       31.14   Jarque-Bera (JB):          97447842.12
## Prob(Q):                              0.84   Prob(JB):                         0.00
## Heteroskedasticity (H):               5.01   Skew:                           -18.28
## Prob(H) (two-sided):                  0.00   Kurtosis:                       675.11
## ===================================================================================
## 
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """

We can improve each of the models invidivually for slightly better forecast performance.

R

Set up

The main packages used in R tutorial are:

  • dplyr: For data programming
  • ggplot2: For plots construction
  • gridExtra: For plots alignments
  • forecast: For fitting ARIMA model & forecast

Data visualization

# plot trend of all stocks
fig.cap1 = "**Figure 1.** *Daily trend of all stocks, 2000-2020.*"
nifty_ts = reshape2::melt(nifty_clean[, -c(2)], id.vars = "Date")
ggplot(nifty_ts, aes(x = Date, y = value)) + 
    geom_line(color= "deepskyblue4") + 
    theme_bw() +
    facet_wrap(~ variable, scales = "free_y", ncol = 1)
**Figure 1.** *Daily trend of all stocks, 2000-2020.*

Figure 1. Daily trend of all stocks, 2000-2020.

From the trend of all stocks, we can see the time series of exhibit non-stationarity. There was a substantial strike to the India stock market after the outbreak of Coronavirus.

Determine model parameters

R tutorial will focus on the variables: Symbol, VWAP, Volume, Trades and a newly created variable Return, which is the difference between Close and Prev Close.. First we choose one stock “ADANIPORTS” to analyze its ACF/PACF of trend on the above four variables. Normally, the choice of p and q in ARIMA(p, d, q) depends on ACF/PACF plots. The trend plot above shows huge volatility in VWAP, Volume, and Turnover, thus we can take log transformation to decrease its trend. The function for generation of ACF/PACF plots are ggAcf() and ggPacf() both under forescast package. You can choose to use plot.acf() under S3 method.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
candidate = "ADANIPORTS"
vars_list = c("VWAP", "Volume", "Turnover", "Return")

nifty_cand = nifty_clean %>% 
  filter(Symbol == candidate) %>% 
  mutate_at(vars(matches(c("VWAP", "Volume", "Turnover"))), log)

# plot ACF
fig.cap2 = "**Figure 2.** *ACF plots for stock: ADANIPORTS.*"

acf_list = vector(mode = "list", length = length(vars_list))
names(acf_list) = vars_list
for ( var in vars_list ) {
  acf_list[[var]] = forecast::ggAcf(nifty_cand[[var]], lag.max = 60) + 
    ggtitle(var) + 
    theme_bw()
}

do.call("grid.arrange", c(acf_list, ncol = 2))
**Figure 2.** *ACF plots for stock: ADANIPORTS.*

Figure 2. ACF plots for stock: ADANIPORTS.

Notice that all the variables show high autocorrelation except for Return, which is because Return is calculated from the first difference of closing price working as a linear filter applied to eliminate a trend. Since we are going to apply ARIMA model to the data, which can only works for stationary time series, let’s take first difference of other three variables and compare the autocorrelation plot to the previous one. Later, we will apply auto.arima() function in R, which works for non-stationary time series by apply appropriate times of difference to detrend data.

fig.cap3 = paste("**Figure 3.** *ACF plots for stock: ADANIPORTS.*",
                 "VWAP, Volume, and Trades have taken first difference.")
do.call("grid.arrange", c(acf_diff_list, ncol = 2))
**Figure 3.** *ACF plots for stock: ADANIPORTS.* VWAP, Volume, and Trades have taken first difference.

Figure 3. ACF plots for stock: ADANIPORTS. VWAP, Volume, and Trades have taken first difference.

fig.cap4 = paste("**Figure 4.** *PACF plots for stock: ADANIPORTS.*",
                 "VWAP, Volume, and Trades have taken first difference.")

do.call("grid.arrange", c(pacf_list, ncol = 2))
**Figure 4.** *PACF plots for stock: ADANIPORTS.* VWAP, Volume, and Trades have taken first difference.

Figure 4. PACF plots for stock: ADANIPORTS. VWAP, Volume, and Trades have taken first difference.

By looking at the ACF/PACF you can have a general idea which p and q value to choose. Take VWAP for example, both plots show the high ACF and PACF end on the second lag, suggesting that ARIMA(2, 1, 2) might be suitable for VWAP. However, the general eye bowling is not that precise and for variable like Return it is tricky to find p and q by ACF/PACF plots. As a result, we can use AIC as criteria to choose p and q.

Fit models

Let’s tabulate some AIC values for a range of different choices of p and q, assuming d takes 0 for Return while 1 for other 3 variables. We will subset the last 120 time series as test data. Below shows the AIC table of fitting ARIMA on Return time series of stock: “ADANIPORTS”.

aic_table = function(ts, P, Q, d){ 
  table = matrix(NA, (P + 1), (Q + 1)) 
  for(p in 0:P) { 
    for(q in 0:Q) { 
      table[p + 1, q + 1] <- arima(ts, order=c(p, d, q))$aic
    } 
  }
  dimnames(table) = list(paste("AR", 0:P, sep = ""), 
                          paste("MA", 0:Q, sep = ""))
  table
}

# Construct AIC table
nifty_cand_ts = ts(nifty_cand$Return, frequency = 1, start = c(2000, 01, 03))
nifty_aic_table = aic_table(head(nifty_cand_ts, -30), 4, 4, 0) 
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1

## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1

## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1

## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1

## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
tab.cap2 = '**Table 2**. *AIC for different ARIMA parameters*'
nifty_aic_table %>%
  knitr::kable(format = 'html', caption = tab.cap2) %>%
  kableExtra::kable_styling('striped', full_width = TRUE) 
Table 2. AIC for different ARIMA parameters
MA0 MA1 MA2 MA3 MA4
AR0 27514.98 27516.98 27518.81 27518.31 27518.78
AR1 27516.98 27518.98 27520.96 27518.11 27519.28
AR2 27518.83 27520.66 27514.92 27518.32 27477.74
AR3 27517.96 27517.47 27478.51 27518.30 27520.65
AR4 27518.41 27519.87 27518.41 27472.75 27514.69

The AIC table suggests that ARIMA(4, 0, 3) with the smallest AIC is the best model for the return of “ADANIPORTS”. This model may imply that increasing p and q will tend to get smaller AIC for a better fit. However, models with higher p and q are more complex, so it may lead to problems like overfitting, numerical stability and etc. We usually prefer a simply model, which also better for interpretation.

Even though it is nice to view the change of AIC value as the change of p and q, for a big data set like this, it is very inefficient to iterate over range of p and q. auto.arima()in the forest package is much faster in generating the results. It uses variant of Hyndman-Khandakar algorithm, which combines unit root test, minimizing AICc and MLE, and etc as evaluation criteria. auto.arima() on the training dataset for which the order specified is (4, 0, 2).

ts_arima = auto.arima(head(nifty_cand_ts, -30), max.p = 4, 
                      max.q = 4, max.d = 3)
print(ts_arima)
## Series: head(nifty_cand_ts, -30) 
## ARIMA(4,0,2) with zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ma1     ma2
##       1.4754  -0.7572  0.0017  0.0138  -1.4765  0.7656
## s.e.  0.2919   0.1707  0.0413  0.0362   0.2924  0.1669
## 
## sigma^2 estimated as 364.2:  log likelihood=-13751.28
## AIC=27516.55   AICc=27516.59   BIC=27558.94

The return equation can be written as: \[X_t = 1.475 X_{t-1}-0.757X_{t-2}+0.002X_{t-3}+0.014X_{t-4} -1.477\varepsilon_{t-1}+0.766\varepsilon_{t-2}\]

Model Diagnosis

Lastly, we will test our model by forecasting the next 120 time series and compare the result with our test set.

ts_forecasts = forecast(ts_arima, h = 30) 
acc = accuracy(ts_forecasts, head(tail(nifty_cand_ts, 30), 7))
print(round(acc, 4))
##                   ME    RMSE    MAE     MPE    MAPE   MASE    ACF1 Theil's U
## Training set -0.0247 19.0664 6.8923     NaN     Inf 0.7046 -0.0016        NA
## Test set      0.4030  5.6660 4.8247 96.1783 96.1783 0.4933 -0.3315    0.8505

The RMSE and MAE for the test set are 19.0664 and 6.8923, respectively. Furthermore, we could plot the residual plot of our forecast.

fig.cap4 = "**Figure 5.** *Residual Diagnosis*"

p1 = autoplot(ts_forecasts, main = "") + xlab("Day") + 
  ggtitle("Residuals of Forecast") +
  ylab("Return") +
  theme_bw()
p2 = ggAcf(resid(ts_arima)) + ggtitle("ACF of residuals") +
  theme_bw()

grid.arrange(p1, p2, ncol = 2)
**Figure 5.** *Residual Diagnosis*

Figure 5. Residual Diagnosis

Unfortunately, the residual plot does not appear normal. It suggests the result is heavily tailed. As the ARIMA model that we applied takes the MLE approach with moment assumptions, our data clearly do not hold the Gaussian distribution. ACF of residuals indicates that there is correlation in the residuals series. Thus, our model fails to account for all available information.

One way to imprve the model is to take log-transform of the data. A second way is to apply the ARIMA model that fits t-distributed errors without assuming Gaussian white noise. A third way is to use data segmentation that takes interventions into consideration, as stock data is often affected by government policy.

Model Improvement

Let’s try to take log transformation for Close and Prev Close prior to calculating the Return.

nifty_cand = nifty %>% 
  filter(Symbol == candidate) %>% 
  mutate_at(vars(matches(c("Close", "Prev Close"))), log) %>% 
  mutate(Return = Close - `Prev Close`)

nifty_cand_ts = ts(nifty_cand$Return, frequency = 1, start = c(2000, 01, 03))

ts_log_arima = auto.arima(head(nifty_cand_ts, -30), max.p = 4, 
                      max.q = 4, max.d = 3)

print(ts_log_arima)
## Series: head(nifty_cand_ts, -30) 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##           ar1     ma1
##       -0.0951  0.0900
## s.e.   2.7882  2.7899
## 
## sigma^2 estimated as 0.001757:  log likelihood=5521.64
## AIC=-11037.29   AICc=-11037.28   BIC=-11019.12

auto.arima() suggests ARIMA(1, 0, 1) is the best fit for log returns.

fig.cap6 = "**Figure 6.** *Residual Diagnosis of log returns.*"

grid.arrange(p1, p2, ncol = 2)
**Figure 6.** *Residual Diagnosis of log returns.*

Figure 6. Residual Diagnosis of log returns.

We can see after taken the log transformation, there seems no significant correlation in the residuals series and variation of residuals stays very much the same apart from two outliers. Consequently, We can now be confident about model forecasts, which appears to account for all available information, but prediction intervals that are computed assuming a normal distribution may still be inaccurate.

Stata

Data visualization

Firstly we visualize the data and the stock “ADANIPORTS” is taken as an example.

use NIFTY_clean, clear

keep if symbol == "ADANIPORTS"

graph twoway line vwap date, color("blue") xtitle("Days") ///
  ytitle("Volume weighted average price")
graph export vwap_data.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export vwap_data.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ///
  ytitle("Turnover")
graph export vwap_data.png, replace
**Figure 4.1**. *Time series plots of ADANIPORTS***Figure 4.1**. *Time series plots of ADANIPORTS***Figure 4.1**. *Time series plots of ADANIPORTS*

Figure 4.1. Time series plots of ADANIPORTS

Determine model parameters

We will use the time series VWAP for the analysis below.

For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.

use NIFTY_clean, clear

foreach sym of local sbls {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    dfuller d1.vwap
}

We do the test on the with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.

Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots.

Note: we will only plot the first 8 stocks as an example.

use NIFTY_clean, clear

levelsof(symbol), local(sbls)

local sbls_f8 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV ///
  BAJFINANCE BHARTIARTL BPCL"

foreach sym of local sbls_f8 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    ac D.vwap
    graph export acf_`sym'.png
    pac D.vwap
    graph export pacf_`sym'.png
}

The PACF plots for the first 4 stocks are the following:

**Figure 4.2**. *ACF and PACF plots of ADANIPORTS***Figure 4.2**. *ACF and PACF plots of ADANIPORTS***Figure 4.2**. *ACF and PACF plots of ADANIPORTS***Figure 4.2**. *ACF and PACF plots of ADANIPORTS*

Figure 4.2. ACF and PACF plots of ADANIPORTS

And the ACF plots for the next 4 stocks are the following:

captioncaptioncaptioncaption

caption

We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.

Fit models

According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.

Note: we will only plot the first 5 stocks as an example.

use NIFTY_clean, clear

local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"

foreach sym of local sbls_f5 {
    use NIFTY_clean, clear
    keep if symbol == "`sym'"
    tsset date
    arima vwap, arima(1,1,1)
    estat ic
    mat l_aim = r(S)
    scalar aic_aim = l_aim[1,5]
    arima vwap, arima(1,1,0)
    estat ic
    mat l_ai = r(S)
    scalar aic_ai = l_aim[1,5]
    if aic_aim > aic_ai {
        arima vwap, arima(1,1,0)
        predict vwap_p
        gen vwap_p = vwap_pd + vwap
        graph twoway line vwap date, lwidth("vthin") color("blue")///
          || line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png
    } 
    else {
        arima vwap, arima(1,1,1)
        predict vwap_pd
        gen vwap_p = vwap_pd + vwap
        graph twoway line vwap date, lwidth("vthin") color("blue")///
          || line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
        graph export fitted_`sym'.png
    }
}

The sample fitted graphs are:

captioncaptioncaptioncaptioncaption

caption

Model improvement

Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.

However, Stata does not have some similar functions as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.

Conclusion

To conclude, in this tutorial we covered applying ARIMA model to forecasting stock related variables using Python, R and Stata. We also cross validate our results with actual data and suggest a model improvement method. Among all three programming language, Python and R are very powerful to model time series data with implement auto_arima()/auto.arima() function which Selects appropriate values for p, d and q automatically. For Stata, determination of parameters is mostly based on looking at ACF/PACF plots with trial and error. However, we should not blindly rely on automatic procedure. It is worthwhile to know how changes p, d and q affect the long-term forecasts as well as prediction intervals.

References

  1. A modern Time Series tutorial: Link

  2. ARIMA model in Wikipedia: Link

  3. ADF test in Wikipedia: Link